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Abstract 



We analyze in details some implementations of a challenging, yet simple application: CERN's 
calorimeter. We try both general purpose computer architectures (single and multi processors, 
Simd and Mimd), and special purpose electronics (full-custom, gate-array, FPGA) on the 
problem. 

All measures are expressed in a single common unit for computing power: the Gbops 1 . It 
applies to all forms of digital processors, and across technologies. What's more, Noyce's thesis 
provides a reliable way to extrapolate Gbops benchmarks through future time, say up to year 
2001. 

The quantitative result of our analysis shows that special purpose processing is much more 
efficient than general purpose processing, on our specific problem. We show how to map the 
calorimeter on a programmable active memory PAM 2 , at performance and cost comparable to 
those of fully dedicated implementations: orders of magnitude faster than any general purpose 
implementation, in 1992. We argue that this current computational power advantage for PAM 
technology will increase with time. 

Finally, we discuss how to program such novel virtual PAM computers in the 2Z language, for 
very large synchronous designs. 

Resume 

Nous analysons en detail les implementations d'une application, le calorimetre du CERN. Bien 
que simple a exprimer, ce probleme requiert une grosse puissance de calcul. Nous traitons a la 
fois des ordinateurs programmables, avec un ou plusieurs processeurs, SIMD comme MIND, 
et des materiels digitaux specialises pour notre application, aux travers de leurs technologies 
de realisation - full-custom, gate-array, FPGA. 

Nous introduisons une mesure unique, le Gbops, dans laquelle sont exprimees toutes les formes 
prises par la puissance de calcul. De plus, la these de Noyce nous donne une base solide, pour 
extrapoler nos mesures dans le futur, disons jusqu'en 2001. 

Le resultat quantitatif de cette analyse est que le traitement materiel specifique du calorimetre 
est beaucoup plus efficace que son traitement logiciel. Nous montrons comment implanter le 
calorimetre sur memoire active programmable PAM, avec une vitesse et un cotit comparable 
a ceux des autres realisations specifiques: des ordres de grandeur plus rapides que toutes les 
solutions programmers en 1992. Nous argumentons que cet avantage en puissance de calcul 
pour la technologie PAM augmentera avec le temps. 

Enfin, nous presentons une facon de programmer le calorimetre sur PAM, dans le langage 2Z 
pour les vastes systemes synchrones. 

1 10 9 binary operations per second 
2 large array of configurable logic 
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1 Noyce's Thesis 



Since the advent of modern digital computers, we have lived through 40 years of an exponential 
growth which is unique in the technical history of mankind. 



Thesis 1 (Noyce) The computing power per unit doubles every year. 



This was first pointed out, in an equivalent form, by R. Noyce in the early sixties. Circuit 
technology integrates many contributions: they arise from almost every area of modern science 
and technology, spanning the range from solid state physics to digital computer architecture 
and programming languages. 

Yet, as Noyce observed, the complex combined cumulative effect of all these punctual and 
discrete advances is as if, the feature size of our circuit manufacturing technology was simply 
shrinking linearly with time. As far as experimental evidence goes, and it is plentiful, the 
average shrink factor per year is about a k 1.25. As documented further by C. Thacker, the 
shrink factor has remained statistically steady for over 30 years, and we have every reason to 
believe that it will keep doing so in the near future, say up to year 2001 (see [T92]). 

Let G (a natural number) be the number of logic gates which one can effectively fit within a 
unit area by the end of year y, and F (in Hertz) be the frequency at which one can reliably 
operate such gates. The computing power is the product of these two figures: 

P = GxF. (1) 




Figure 1: Growth rates of G (number of storage bits), F (operating frequency) and P = GF 
(computing power) for static RAM technology 

By the end of the next technology year / = J+l, the corresponding figures in (1) are G' = a 2 G 
and F' = aF. Since a m 3 v / 2, we find that P' = IP, which is how we stated Noyce's thesis. 
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Keep in mind that it is just an observation about human science; it is neither a law from physics, 
nor a theorem from mathematics. It has to do with economy and technology questions, such 
as: Is GaAs faster than ECL? Will BiCMOS take over CMOS? At what cost? 

2 Summary 

Noyce's thesis provides a nice and simple model against which we attempt to analyze the 
impact of time, i.e. technology, on computer architecture. For this purpose, we start from a 
single application: CERN's calorimeter. 

• The problem is simple enough to be fully stated in Section 3. Its large computing 
requirements are analyzed, step by step in Section 5. 

• It is part of a series of benchmarks put forward by CERN 3 in [B&al92]. The goal 
is to measure the performance of various computer architectures, in order to build 
the electronics required for the Large Hadron Collider LHC, before the turn of the 
millennium. 

• It is challenging, and well documented: [B&al92] provides benchmarks for a dozen 
electronic boxes, including most of the fastest current computers, on the calorimeter and 
other problems. 

Our object is to complement CERN's experimental benchmarks by a convergent, and more 
theoretical analysis of the calorimeter; We use it, in conjunction with Noyce's thesis, in order 
to make some predictions regarding the future. 

We try two types of implementations, for solving our problem. 

The first are representative of the computing power achieved by general purpose computer 
architectures on the calorimeter: this year's fastest computer on a chip; compared to both 
massively and moderately parallel implementations. We analyze the cycle time required for 
such machines, and predict a year when it should become technologically feasible to implement 
the calorimeter at speed, on each. 

The second is representative of the computing power delivered by special purpose digital archi- 
tectures, specifically designed to perform the calorimeter's computation. An implementation 
in PAM technology is presented in Section 7.2. It is representative of three related design 
methodologies: full-custom, gate-array and field programmable gate array FPGA. 



3 Centre pour l'Etude des Reactions Nucleates, in Geneva, Switzerland 
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Figure 2: CERN's view of ATLAS/LHC 



We apply the same evaluation method to all cases: 

1. First assess the theoretical computing power of the machine; all measurements for 
computing power are expressed in a single common unit, the Gbops. 

2. Next, we analyze the actual computing power, as measured by running the machine on 
the calorimeter. 

3 CERN's Calorimeter 

The function of the calorimeter is to identify the position and most likely nature of a particle 
which traverses a digitized RDI, a square S = {ij : 0 < ij < 20}. Within the LHC, energy 
sampling occurs at the rate of 100 kHz, i.e. each 10 //sec. 

The input is a pair of energy maps (£', E"[iJ] for ij £ S) providing the line-by-line responses 
from two analog detectors, digitized down to 16 bits. The average input rate is 160MB/s, 
presented on two channels (32b, 20MHz each). The analysis of event (E', E") is done by 
computing: 

E The pixel-wise sum: E = E' + E". 

S The total energy: 5 = Ss^I'j]- 

M The maximum energy: M = E[i m J m ] = ma.xsE[i,j]. 
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O The first statistical moment: O = J2s r m[iJ] X E[iJ], 

centered at E's maximum. Here, r m [ij] is a tabulated 8b integer approximation of the 
distance yj {i - i m ) 2 + (j-j m ) 2 . 

P The peak energy: P=J2s Pm[i,j] X E[iJ]; 

here p m [i,j] is one if \i — i m \ + \j — j m \ < 2, zero otherwise. 

D The final discriminant: D = sign(a^^ — /3f )• 

The final discrimination between an electron and a hadronic jet is based on computing 
the sign of D, for some (experimentally determined) suitable 16b integers a and fi. 

The whole computation is carried out with 16-bit integers. The output rate is only 100 kb/s, 
one decision bit per energy pair (£', E"). The computation of the maximum implies that we 
have to buffer a full energy map, between steps E, M and steps O, P, D. 

4 The Gbops 

Our only analytical tool so far is definition (1) of the computing power, which is a strictly 
digital measure. The exact analog process through which our mathematical computing power 
gets physically delivered does not matter here. What exactly is a gate is not important either: 
it only affects our measure by a constant multiplicative factor - provided that we keep bounded 
fan-in). 

Our favorite accounting unit calls one any operation which is no more complex than a single 
bit-serial binary addition. Or subtraction, for that matter; or any gate with at most three inputs, 
and one bit of internal state. 

Let 1 Gbops be 10 9 binary operations per second, our unit for computing power. It is delivered 
by any Bop circuit, operating at 1 GHz. 



ab 




s 



Each Bop circuit, which we call active bit, is made of two boolean functions 5, C G B 3 1-4 B 
and a synchronous register; they are connected as shown in the schema above, or the 2Z code 
which follows. 

Bop(a, b) = (s, r) 
where 

n = C (a, b, s) ; //Next state 
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s = reg(n); //Flip flop 

r = S (a, b, s) ; //Result 
end where; 

Active bits include the bit-serial binary adder, which is obtained by choosing: 

S(a,b,s) = {a + b + s) ■ \ ■ 2, 
C(a,b,s) = (a + b + s) -r 2. 

The accounting rules which follow, for arithmetic and logic operations over n bit word inputs, 
are straightforward: 

+ One n + n h-> « + 1 bits addition each nano second is worth n Gbops. Subtraction, integer 
comparison and logical operations are bit-wise equivalent to addition. 

X One n x m H- n + m bits multiplication each nano second is worth nm Gbops. Divi- 
sion, integer shifts and transitive (see [V83]) bit permutations are bit-wise equivalent to 
multiplication; consequently, so is a n H> m look-up table LUT, or RAM access. 



5 Calorimeter Analysis 



We now count the number of Gbops required by each step of the calorimeter. 



E The input is composed of four digital flows: 4x 16bx20MHz. We must add together the 

EM = E'M + E'/IO]; 

first and last two flows: = ^ + ^_ 

Each addition requires 16 binary operations per cycle: P e = 16x2x20M = 0.64 Gbops. 
S The input is : 2x 16bx20MHz. We sum all energies from the same map: P s = 0.64 Gbops. 

M The maximum can be computed by using the sign of the subtraction to select the proper 
argument, at a cost of 640 Mbops; together with keeping up to date the 10b index (i m ,j m ) : 
0.4 Gbops. In addition, the maximum requires to store a complete map, in a 400 X 16b 
double access RAM. So we charge 2 x 10 X 16 X 40M = 12.8 Gbops. Total: P m = 13.84 
Gbops. 

O The first statistical moment is the most complex operation. It requires a 20bi-)'16b look-up 
table LUT for finding the distance r m [ij]: 12.8 Gbops. The 8b multiplication accounts 
for 5.12 Gbops. Total: P Q = 17.92 Gbops. 

P The peak is the cheapest operation; at 5 additions 16b per map, it requires a negligible: 
P p = 8 Mbops. 

D The discriminant is expensive, 2 x 16 X 48 active bits; it is executed once per map, so the 
computing power required here is only: P^ = 154 Mbops. 
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The computing power required by the complete c alorimeter computation is the sum V 
P e + P s + P m + P 0 + P p + P dl namely I 33 Gbops . 



Note that our accounting does not take into consideration any of the required data movements: 
from input to processing unit; from processing unit to output. Such transport operations do not 
transform values; they do not directly contribute to the final decision D: They get charged here 
as overhead, exclusively accounted for in the virtual computing power of the implementation 
technology, and not in the actual computing power. 



6 General Purpose Architectures 

To make the analysis simple, we give general purpose technology all benefits from the doubt: 
caches are all assumed to be wide enough and fast enough, in order to provide each Cpu with 
data and valid instructions, at no latency but the minimum feasible. 

Ignore the fact that both data and instructions caches would have to be huge, by 1992's 
standards. This permits to perfectly streamline the calorimeter's computation: unroll all loops, 
and take one cycle per fetch or store, on every memory access. 

Ignore the fact that, in 1992, none of the general purpose machines benchmarked by CERN 
could cope with the calorimeter's external input bandwidth of 160MB/s. So, the input had to 
be, faked in the experiments. 

6.1 Single Processor 

In 1992, the highest performance microprocessor has 64b of data, clocked at 200MHz. The 
virtual computing power of this Cpu64b200MHz is 64 x 0.2 = 12.8 Gbops. We know from 
the calorimeter analysis that this processor is not fast enough. 

Let us analyze the number of clock cycles required for running the calorimeter on a reduced 
instruction set Rise processor. In a streamlined code, the number of cycles required to process 
the calorimeter, for each \6b X 20 X 20 energy map, is: 

C e + C s + C m + C 0 
5 + 2 + 3 + (4+ 16), 
4005 + C p + C d 
12.1 K cycles. 

The calorimeter operates at lOOKHz; so, the minimum cycle time at which we can expect to 
run this program is 1.2 GHz. Noyce's thesis predicts that this Cpu64b at 1.2GHz will become 
technologically feasible around year 2000. 

The moment M gets computed in 4 cycles for the look-up table LUT, and 16 cycles for the 
actual multiplication. On a machine with a hardware multiplier, C Q gets reduced to 8 cycles. 
So the clock at which we need operate the calorimeter is only 720 MHz. A Fpu64b720MHz 
should be feasible by year 1998. 
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The virtual computing power of Cpu 64b at 1 .2GHz is 87 Gbops. The virtual computing power 




Figure 3: In this popular 68000 micro-processor, the area of the actual Cpu is less than 

1/200-th of the whole 

In both cases, the computing power actually expanded on the calorimeter is only 33 Gbops. 
The respective virtual to actual power ratios are about 3 and 47. Observe that Fpu delivers at 
most 4.8 Gbops when it is only computing additions or equivalently cheap operations: a very 
low utilization of the computing power virtually available in the multiplier. 

Note that large data paths (32b or 64b) past 16b do not help the calorimeter: the whole 
computation can be performed on 16b integers, except for the final decision where 48b are 
convenient. To conclude on single processors, our best fit to the calorimeter are: 

Cpul6bl.2GHz The ratio between the 43 Gbops (equals 19 for Cpu plus 24 for the LUT and 
RAM) virtual power and the 35 Gbops actual power is near one. We achieve an optimal 
fit where the 16b computed in each cycle all contribute to the calorimeter's decision. 

Fpul6b720MHz The ratio between the 189 Gbops virtual power and the 35 Gbops actual 
power is near five. The multiplier is used at less than one fifth of its capacity. 

Although Cpul6bl.2GHz makes the single processor RISC software solution optimal for the 
data path of the calorimeter, it is hiding a large structure (with high Gbops virtual cost) for 
handling its hierarchical data and instruction memories. There is a lot more to Cpul6bl.2GHz 

4 A 64b floating point unit, with 48b mantissa and 16b exponent, which operates at 100MHz delivers 230 Gbops. 
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than just its Alul6b. If you care to look at Figure 3, the part of interest here is only a tiny 
fraction of the silicon area in the full microprocessor. 

6.2 Multi Processors 

The class of massively parallel processors fares poorly on CERN's calorimeter. The strong 
experimental evidence provided in [B&al92] can probably be explained by observing that any 
attempt to process the calorimeter on a pool of slow processors implies a large cost: the amount 
of memory required is proportional to the number k of processors used; so is the bandwidth 
required for communicating the proper data to the proper processing units. 

Massively parallel solutions to the calorimeter are ruled out by economics and engineering 
problems. A 4k=4096 parallel processors 4b SIMD operating at 12MHz machine - call it 
4kPP4bl2MHz - has a virtual computing power of 200 Gbops. Yet, the implied cost in 
memory and communication makes it incapable, in 1992, to compute the calorimeter anywhere 
near real time. 

Processing independently the six steps of the algorithm is the best room available for parallelism 
in the calorimeter. Each processor performs some of the steps (E,S,M,0,P,D). 

This multiple instruction, multiple data parallel machine operates at the speed of its slowest 
component, namely the moment unit M. Using here both a LUT and a multiplier 16b, we reduce 
C 0 to eight cycles. Each 16b processor is now only required to operate at 40 X 8 = 320 MHz 
in order to process the calorimeter. Such a parallel MIMD processor - call it 6PP16b320MHz 
- should be feasible before 1996. 

Note that the bandwidth required between the 6 processing units is 8x 80 = 640 MB/s, a taxing 
requirement for all general purpose architectures. 

Past such a simple six long assembly line organization, there is little to be gained through 
parallel processing: the increase in storage and communication is not worth the benefit in 
effective operations. 

7 Special Purpose Architectures 

From the nature of the physical interface of the calorimeter (input on two HIPPI channels 
32b20MHz, output on the host's TURBOchannel 32b25MHz), we know that a minimal size 
electronics has four printed circuit boards (say 6cm x 8cm in 1992): two for input, one for 
output, and one board for the calorimeter algorithm per se, and connecting to the other boards. 
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The calorimeter algorithm maps directly into eight functional units: schemas above, 2Z code 
below. 



Calorimeter ( {E' , E" } : [32]) = D 
where 

E = AddUnit(E', E" ) ; // WOMB s peak output 
// 159 zeroes and a one, period 200 bits 
cl59 = Sdd(2**159/ (1-2**200) ) ; 
(ij, M) = MaxUnit(E, cl59) ; 
( r , p ) = Lut ( i j , c 1 5 9 ) ; 
e = Sto(E, cl59) ; //Delay Wmus 
reset cl5 9 do 
S = SumUnit (e) ; 
P = PeakUnit (e, p) ; 
0 = MomentUnit (e, r) ; 
end reset; 

D = OutputUnit (P, S, 0, M, cl59); 
end where; 

7.1 Hardware Blocks 

We present the function of each atomic unit; when it is relevant, we provide the 2Z code from 
which the PAM configuration for the corresponding unit can be derived. We also analyze the 
virtual computing power required by each step of the PAM implementation. 
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The whole design is synchronized with the 160MB/s input, by a 40MHz clock. Both the period 
(lOfis) and the delay (20/is) are kept at their absolute minimal values. Each arithmetic unit is 
precisely tailored to its function: 16 bits parallel operators for steps 1 through 5, connected 
according to the schematics below. Step 6 is implemented in a fully bit-serial manner, to take 
best advantage of the low bandwidth requirement on this final electron/hadron jet decision. 

E The input is composed of four parallel digital flows: 4x 16bx20MHz. We first interleave 
E' t [0] and E' t [l] in time, so that E' 2t = E' t [0] and E' 2t+l = E' t [l\, similarly, interleave E'/[0] 
and Ef[l] so as to produce E" at 16b40MHz. In PAM technology, each interleave is 
realized by a specific column of 16 Pabs 5 , at cost: 2x 16b40MHz = 1.28 Gbops. We add 
together the two flows through a 16b adder at 40MHz. The required computing power 
is 3 X 16 Pabs: P e = 1.92 Gbops. 

M Computes the maximum M of the current map E at 16bl00KHz, and its index (i m J m ) at 
lOblOOKHz. In PAM technology, the maximum is best implemented from high to low 
bits (see [BVS94]). The computing power of this unit is 2 x (16b + 10b) X 40 MHz = 
2.08 Gbops. 

Sto Double buffer the current 400 X 16b energy map E while reading the previous one e. 
Both flows are 16b40MHz. In our PAM implementation, we use a 2 x400 X 16b40MHz 
double access RAM: 12.8 Gbops, and 400 Mbops to control the addresses. Total: 
P st0 =13.2 Gbops. 

S Sum all energies from the same map, with a 16b accumulator: P s = 640 Mbops. 

SumUnit (E: [16] )=S: [16] 
where 

R = Add (16) (E, S, 0) ; 
S = Reg (16) (R) ; 
end where; 

P The peak is the computed at full cost: P p = 640 Mbops. Bit p = p m (ij) is produced by the 
LUT. 

PeakUnit (E : [ 1 6 ] , d) =P : [16] 
where 

for k<16 do //sum when d—1 

F[k] = E[k] & d 
'Programmable Active Bit 
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end for; 

P = SumUnit (F) ; 
end where; 

O The first statistical moment is implemented by a 20bi-)'9b LUT for finding the 8b distance 
r m [ij] and the lb peak p = p m (ij). The address control for this RAM uses 800 Mbops. 
Including the 8b multiplication, we find: 
P 0 = 18.72 Gbops. 

MomentUnit (E : [ 1 6 ] , D : [ 8 ] ) = 0 : [ 16] 
where 

P = Mul (16, 8) (E, D) ; 
0 = SumUnit (P [0. . 15] ) ; 
end where; 

D Each of the M,P,S,0 units takes inputs at 16b40MHz and produces 16b of output at lOOKHz. 
The four outputs (M,P,S,0) are consumed by the decision unit, 4x 16b at lOOKHz, in 
order to produce the final decision D at lb per lOOKHz. The PAM implementation of 
the discriminant is detailed in Section 8. 

Note that the virtual computing power required for our PAM calorimeter is only V = 39 Gbops. 
The ratio between actual and virtual power is very near one, as for Cpul6b at 1.2GHz. The 
difference is that here, the whole chip area is devoted to the calorimeter computation. There is 
no hidden virtual cost for managing PAM data. 

From this level of description, we can design and implement a full-custom solution in one chip. 
That makes up for a relatively empty calorimeter board: a single chip and lots of connectors. 

An easier solution is to realize all but the LUT stage in a calorimeter gate-array; implement 
the LUT by a RAM; this is a two chips implementation of the calorimeter board. 

We can implement the calorimeter on a generic PAM board (same size as all others). It is 
composed of two RAM banks, one FPGA, two input connectors and one output connector. It 
can be ready made from off the shelf components. 

The only difference between our three boards is their cost per unit. All are functionally 
equivalent calorimeter implementations. They also have equal performance. 

7.2 Programmable Active Memories 

As our reader is not assumed to be familiar with this technology, let us survey some of the 
concepts in this new emerging field. The following is from [BRV89]: 
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Definition 1 (PAM) A PAM is a uniform array of identical cells all connected in the same 
repetitive fashion. Each cell, called a Pah (for Programmable Active Bit) is configurable 
enough so that the following holds true: any synchronous digital circuit can be realized, 
through suitable configuration of each Pab, on a large enough grid and for a slow enough 
clock. 

A Pab is the basic building block out of which FPGAs are built. There are many ways to 
construct a Pab which has the required generality. The FPGAs from [A90], [C&al86], [C92] 
and [X91] present four rather different implementations of the concept. The Bop circuit from 
Section 4 provides another example of universal Pab. 

It should be pointed out that the five Pab structures mentioned so far do not exactly have the 
same computing power: while it only takes one of either [X91] programmable active bit to 
implement a serial adder, it takes two of [C&al86] and four of [A90] or [C92] to realize the 
same function. Such factors must be accounted for in the detailed analysis of their virtual 
computing power. With our Pab=Bop choice, we simply count one binary operation per Pab. 

It takes quite a bit more than our PAM definition to obtain a workable and powerful general 
purpose digital engine. The most important designs issues involved are thoroughly discussed 
by P. Bertin in [B93]. Besides ours, which were built at INRIA and DEC-PRL, other success- 
ful PAM implementations have been reported, in particular at the Universities of Edinburgh 
[KG89], Zurich [BP92], and at Maryland's SRC [ABD92]. Let us also mention [Q91] which 
is a large PAM, dedicated to hardware emulation. 

The ratio between the theoretically available computing power, and that practically usable for 
the calorimeter is much lower for dedicatedhardware than for general purpose solutions. PAM 
technology combines the best from both: 

• being a universal virtual machine, the PAM can be configured to a wide class of computing 
units. As software, it is by no means limited to processing a single application. 

• being configurable at the gate and wire level, a properly dimensioned PAM can emulate 
efficiently each special purpose hardware. A fixed size PAM, say 16x20 Pabs at 40MHz, 
has some well defined virtual computing power: 12.8 Gbops. With some design effort, 
it was found on a large number of test cases that such PAM can simulate in real time, 
any specific dedicated synchronous hardware whose computing power is less than 12.8 
Gbops. 

• the benefits derived from processing the calorimeter through special purpose hardware are 
large; they are representative of a wide class of applications, for which PAM technology 
provides today an optimal implementation medium. 

We demonstrate in [BRV93], through 10 benchmarks which cover a wide range of applications, 
drawn from arithmetics, algebra, geometry, physics, biology, audio, video and data compression 
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that, our implementation vehicle DECPeRLei consistently performs in the lOOGbops range. 6 

8 RAM Programming 

By its nature, the PAM has lead us to implement hardware algorithms, which are substantially 
bigger than anything yet attempted on single silicon chips: on the order of 250K gates (or 2M 
transistors), excluding RAMs. The sheer size of such designs has forced us to aggressively pur- 
sue the strictly synchronous design paradigm, throughout the PAM implementations reported 
in [BRV93]. 

It has quickly become clear that arithmetic circuits are the key to success in this area. Obviously, 
each of our implementations only requires a finite arithmetic precision. However, any design 
system which claims to cover the whole spectrum (from lb to 4Kb!) requires the ability to 
handle truly arbitrary precision arithmetic. 

The natural mathematical domain into which this leads us is that of the 2adic numbers, both 
discovered and created by K. Hensel around 1900. In [V93], we uncover some of the intimate 
relationships which exist between digital synchronous circuits and 2adic numbers. Capitalizing 
on these results, we attempt in [BVB94] to introduce a new programming, named 2Z , whose 
main function is to help concisely define synchronous circuits. 

The most classical features of 2Z have already been illustrated through examples, since the 
beginning of this paper. Let us complement them by the source 2Z code for the decision unit 
D. It fully exploits the facilities for bit serial arithmetic synthesis, which are quite unique to the 
2Z language. 




The 2Z code corresponding to the above schema is: 

OutputUnit ( { P , S, 0} : [16], M, cl59) = D 
where 

s = ParSer (16) (S, cl59) ; 
enable cl5 9 do 

6 Note that the present definition of the Gbops is only one half of that used in [BRV89]; the aim is to simplify 
out useless constant factors, as one serial bit of addition now amounts to one bop, no longer two as in [BRV89]. 
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0' = Reg (16) (0) ; 

P' = Reg (16) (P) ; 
end enable; 
u = sMul (16) (s, 0' ) ; 
1 = sMul (16) (s - m, P' ) ; 
//serial arithmetic synthesis 
d' = (u * a) - (1 * b) ; 
// Cern 's constants 
a = 134535; 
b = 767665; 
// controls 
enable Fin do 

D = reg d' 
end enable; 
reset cl5 9 do 

Fin = Sdd (2**48) ; 
end reset; 
end where; 

See [BVB94] and [V93] for more details. 

9 Conclusion 

Under Noyce's thesis, we have established the following. 

I The computing power of a single fast Cpu 16b or 64b will grow by a factor two each three 

years. It should reach the 1.2GHz frequency, required for implementing the calorimeter, 
before year 2001. By then, the computing power actually delivered on the calorimeter 
will still be 33 Gbops. 

II The computing power available in a FPGA will grow by a factor eight each three years. 

Let us pick a for starting point 400 Pabs at 40MHz in 1992. By Noyce's thesis, the 
corresponding figures by year 2001 should be 25. 6K Pabs and 320MHz: 8 Tbops per 
cm 2 ! 
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So, by year 2001, a single chip FPGA will have 200 times the computing power of the fastest 
sequential Cpu. A small PAM will be three orders of magnitude more powerful than a single 
Cpu64b at 1.2GHz. An equivalent way to look at this: in 92, a 16 X 20 FPGA at 40MHz has 
the computing power of a vintage 2001 Cpu64b, at 1.2GHz. Two things are clear. 

1. PAM technology will inevitably become an important contributor to the high power 
scientific computations, before the turn of the millennium. 

2. General purpose computers will have to become multi-processors, with a relatively large 
number of processors, in order to sustain the competition. 




92 95 98 01 

Figure 4: The future of two technologies? 

From our experience, it is clear that the main obstacles to the development of PAM technology 
come from the current state of computer aided design Cad system: it is much harder to program 
a PAM, than it is to write code for a serial Cpu. 

The Cad tools all run on sequential Cpus. The computing power available to run the Cad tools 
does not currently scale along with the size of PAM technology. 

The 2Z language is a small step towards meeting such Cad challenges. Orders of magnitude 
must be gained over the current design techniques, in order to implement the truly huge PAM 
designs for year 2001. We now deal with 10K Pabs; by then, we shall have 1M gates to design, 
place and route. 

Based on our observations, it is tempting to venture the question: 

What computing power will be available in a shoe size box by year 2001? 

According to the theories exposed here, and taking 256 Gbops as a reference point for PAM 
technology in 1993, we predict: 

68 Tbops! 
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